Statlog (Vehicle Silhouettes) Data Set

The purpose is to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Attribute Information:

  • COMPACTNESS (average perim)**2/area
  • CIRCULARITY (average radius)**2/area
  • DISTANCE CIRCULARITY area/(av.distance from border)**2
  • RADIUS RATIO (max.rad-min.rad)/av.radius
  • PR.AXIS ASPECT RATIO (minor axis)/(major axis)
  • MAX.LENGTH ASPECT RATIO (length perp. max length)/(max length)
  • SCATTER RATIO (inertia about minor axis)/(inertia about major axis)
  • ELONGATEDNESS area/(shrink width)**2
  • PR.AXIS RECTANGULARITY area/(pr.axis length*pr.axis width)
  • MAX.LENGTH RECTANGULARITY area/(max.length*length perp. to this)
  • SCALED VARIANCE (2nd order moment about minor axis)/area ALONG MAJOR AXIS
  • SCALED VARIANCE (2nd order moment about major axis)/area ALONG MINOR AXIS
  • SCALED RADIUS OF GYRATION (mavar+mivar)/area
  • SKEWNESS ABOUT (3rd order moment about major axis)/sigma_min**3 MAJOR AXIS
  • SKEWNESS ABOUT (3rd order moment about minor axis)/sigma_maj**3 MINOR AXIS
  • KURTOSIS ABOUT (4th order moment about major axis)/sigma_min**4 MINOR AXIS
  • KURTOSIS ABOUT (4th order moment about minor axis)/sigma_maj**4 MAJOR AXIS
  • HOLLOWS RATIO (area of hollows)/(area of bounding polygon)

    Where sigma_maj2 is the variance along the major axis and sigma_min2 is the variance along the minor axis, and area of hollows= area of bounding poly-area of object

  • NUMBER OF CLASSES - 3 CAR, BUS, VAN

Import necessary modules

In [1]:
#Import all the necessary modules
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

#!jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

from scipy.stats import zscore
from sklearn.decomposition import PCA


from IPython.core.interactiveshell import InteractiveShell 
InteractiveShell.ast_nodeinteractivity = 'all'

Read the dataset and split into independent and dependents variables

In [2]:
df = pd.read_csv("vehicle.csv")
target = 'class'
X = df.loc[:, df.columns!=target]
y = df.loc[:, df.columns==target]

List first few rows of the dataset

In [3]:
X.head()
Out[3]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183

View the descriptive statistics of the dataset using describe() function

In [4]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 18 columns):
compactness                    846 non-null int64
circularity                    841 non-null float64
distance_circularity           842 non-null float64
radius_ratio                   840 non-null float64
pr.axis_aspect_ratio           844 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  845 non-null float64
elongatedness                  845 non-null float64
pr.axis_rectangularity         843 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                843 non-null float64
scaled_variance.1              844 non-null float64
scaled_radius_of_gyration      844 non-null float64
scaled_radius_of_gyration.1    842 non-null float64
skewness_about                 840 non-null float64
skewness_about.1               845 non-null float64
skewness_about.2               845 non-null float64
hollows_ratio                  846 non-null int64
dtypes: float64(14), int64(4)
memory usage: 119.0 KB
In [5]:
X.describe().T
Out[5]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.0 119.0
circularity 841.0 44.828775 6.152172 33.0 40.00 44.0 49.0 59.0
distance_circularity 842.0 82.110451 15.778292 40.0 70.00 80.0 98.0 112.0
radius_ratio 840.0 168.888095 33.520198 104.0 141.00 167.0 195.0 333.0
pr.axis_aspect_ratio 844.0 61.678910 7.891463 47.0 57.00 61.0 65.0 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.0 55.0
scatter_ratio 845.0 168.901775 33.214848 112.0 147.00 157.0 198.0 265.0
elongatedness 845.0 40.933728 7.816186 26.0 33.00 43.0 46.0 61.0
pr.axis_rectangularity 843.0 20.582444 2.592933 17.0 19.00 20.0 23.0 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.0 188.0
scaled_variance 843.0 188.631079 31.411004 130.0 167.00 179.0 217.0 320.0
scaled_variance.1 844.0 439.494076 176.666903 184.0 318.00 363.5 587.0 1018.0
scaled_radius_of_gyration 844.0 174.709716 32.584808 109.0 149.00 173.5 198.0 268.0
scaled_radius_of_gyration.1 842.0 72.447743 7.486190 59.0 67.00 71.5 75.0 135.0
skewness_about 840.0 6.364286 4.920649 0.0 2.00 6.0 9.0 22.0
skewness_about.1 845.0 12.602367 8.936081 0.0 5.00 11.0 19.0 41.0
skewness_about.2 845.0 188.919527 6.155809 176.0 184.00 188.0 193.0 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0
In [6]:
def basic_details(df):
    b = pd.DataFrame()
    b['Missing value'] = df.isnull().sum()
    b['N unique value'] = df.nunique()
    b['dtype'] = df.dtypes
    return b
In [7]:
basic_details(X)
Out[7]:
Missing value N unique value dtype
compactness 0 44 int64
circularity 5 27 float64
distance_circularity 4 63 float64
radius_ratio 6 134 float64
pr.axis_aspect_ratio 2 37 float64
max.length_aspect_ratio 0 21 int64
scatter_ratio 1 131 float64
elongatedness 1 35 float64
pr.axis_rectangularity 3 13 float64
max.length_rectangularity 0 66 int64
scaled_variance 3 128 float64
scaled_variance.1 2 422 float64
scaled_radius_of_gyration 2 143 float64
scaled_radius_of_gyration.1 4 39 float64
skewness_about 6 23 float64
skewness_about.1 1 41 float64
skewness_about.2 1 30 float64
hollows_ratio 0 31 int64

Observations

  • There are 19 columns and 846 rows in the dataset
  • The target contains labels for three classes - car, bus and van
  • All independent variables are numerical features
  • There are missing values for few columns, we can impute those using mean and median values based on the skewness
  • Most of the independents variables are distributed near to normal
  • skewness_about and skewness_about.1 have min values as zero, probably they indicate missing values. It would require indepth domain knowledge to make this decision, we would leave these as is.

Lets look at the distribution plot

In [8]:
X.hist(figsize=(20,20))
Out[8]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000023E600E1E48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E621615F8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E6218BC88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E621BB358>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000023E621E09E8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E621E0A20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E6223E5F8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E62263C88>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000023E62293320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E622BC9B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E622ED080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E62312710>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000023E6233DDA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E6236D470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E62395B00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E623C71D0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000023E623ED860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E62416EF0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E624475C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023E6246DC50>]],
      dtype=object)

Impute missing values with median

In [9]:
for i in list(X._get_numeric_data().columns):
    X[i].fillna(X[i].median(), inplace=True)
In [10]:
basic_details(X)
Out[10]:
Missing value N unique value dtype
compactness 0 44 int64
circularity 0 27 float64
distance_circularity 0 63 float64
radius_ratio 0 134 float64
pr.axis_aspect_ratio 0 37 float64
max.length_aspect_ratio 0 21 int64
scatter_ratio 0 131 float64
elongatedness 0 35 float64
pr.axis_rectangularity 0 13 float64
max.length_rectangularity 0 66 int64
scaled_variance 0 128 float64
scaled_variance.1 0 423 float64
scaled_radius_of_gyration 0 144 float64
scaled_radius_of_gyration.1 0 40 float64
skewness_about 0 23 float64
skewness_about.1 0 41 float64
skewness_about.2 0 30 float64
hollows_ratio 0 31 int64

Check how the target classes are distributed

In [11]:
sns.countplot(x='class',data=y)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x23e62f7da90>
In [12]:
y['class'].value_counts()/y.shape[0] * 100
Out[12]:
car    50.709220
bus    25.768322
van    23.522459
Name: class, dtype: float64
In [13]:
y['class'] = pd.Categorical(y['class']).codes
In [14]:
y['class'].value_counts()/y.shape[0] * 100
Out[14]:
1    50.709220
0    25.768322
2    23.522459
Name: class, dtype: float64

There is a slight imbalance in the dataset, observations for 'car' class is higher compared to 'bus' and 'van'. Its good to have a balanced dataset for every class to avoid baised prediction. However, we will use the dataset as is for this analysis and not use any balancing technique.

Bivariate analysis using pair plot and correlation matrix

In [15]:
def correlation_matrix(df):
    corrmat = df.corr()
    top_corr_features = corrmat.index
    plt.figure(figsize=(25,25))
    #plot heat map
    g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")
In [16]:
correlation_matrix(X)
  • max.length_rectangularity is highly correlated with circularity (ρ = 0.96573)
  • scaled_radius_of_gyration is highly correlated with circularity (ρ = 0.93595)
  • scaled_variance is highly correlated with pr.axis_rectangularity (ρ = 0.93818)
  • scaled_variance.1 is highly correlated with scaled_variance (ρ = 0.94977)
  • scatter_ratio is highly correlated with scaled_variance.1 (ρ = 0.99633)

Hence, these columns can be ignored while training the model as they are highly correlated with other variables

Following variables need to be considered for training the model

fcolumns = ['compactness', 'circularity', 'distance_circularity', 'radius_ratio', 'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'elongatedness', 'pr.axis_rectangularity', 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1', 'skewness_about.2', 'hollows_ratio'],

Below Dendrogram also depicts the same

In [17]:
from scipy.cluster import hierarchy as hc

corr = 1 - X.corr() 
corr_condensed = hc.distance.squareform(corr) # convert to condensed
z = hc.linkage(corr_condensed, method='complete')
plt.figure(figsize=(20,8))
dendrogram = hc.dendrogram(z,labels=corr.columns,leaf_rotation =90)
In [18]:
sns.pairplot(df,diag_kind='kde')
Out[18]:
<seaborn.axisgrid.PairGrid at 0x23e63194860>

Observation

  • Relationship between variables shows high correlation (positive and negative) between most of the variables.
  • We can see outliers in the data

Oulier Detection

In [19]:
sns.set(context="paper", font="monospace")
# Create a figure instance
fig = plt.figure(1, figsize=(18, 12))

# Create an axes instance
ax = fig.add_subplot(111)

g = sns.boxplot(data=X, ax=ax, color="blue")
g.set_xticklabels(X.columns,rotation=90)

# Add transparency to colors
for patch in g.artists:
    r, g, b, a = patch.get_facecolor()
    patch.set_facecolor((r, g, b, .3))   
       

Impute outliers

The extreme observations in data set which resembles completely different behaviour from the rest of data point are called outliers. The outliers present in numeric feature can be dealt by following ways based on the domain knowledge

  • delete the outliers (there could be loss of data due to this)
  • impute ouliers with mean/median
  • impute with lower and upper bound values.

In our case we will replace the outliers by 1%/99% of feature value, this might not be the best approach.

In [20]:
def outlier(df,columns):
    for i in columns:
        quartile_1,quartile_3 = np.percentile(df[i],[25,75])
        quartile_f,quartile_l = np.percentile(df[i],[1,99])
        IQR = quartile_3-quartile_1
        lower_bound = quartile_1 - (1.5*IQR)
        upper_bound = quartile_3 + (1.5*IQR)
        print(i,lower_bound,upper_bound,quartile_f,quartile_l)
                
        df[i].loc[df[i] < lower_bound] = quartile_f
        df[i].loc[df[i] > upper_bound] = quartile_l    
        
outlier(X,X.columns)
 
compactness 67.5 119.5 79.0 113.0
circularity 26.5 62.5 34.0 57.0
distance_circularity 28.0 140.0 51.0 109.0
radius_ratio 60.0 276.0 111.0 234.54999999999995
pr.axis_aspect_ratio 45.0 77.0 49.45 75.54999999999995
max.length_aspect_ratio 2.5 14.5 4.0 23.649999999999864
scatter_ratio 70.5 274.5 116.0 253.64999999999986
elongatedness 13.5 65.5 26.0 57.549999999999955
pr.axis_rectangularity 13.0 29.0 17.0 28.0
max.length_rectangularity 104.0 192.0 122.0 178.0
scaled_variance 92.0 292.0 135.0 279.0999999999999
scaled_variance.1 -84.5 989.5 198.35 942.2999999999988
scaled_radius_of_gyration 75.5 271.5 116.0 254.0999999999999
scaled_radius_of_gyration.1 55.0 87.0 61.0 89.54999999999995
skewness_about -8.5 19.5 0.0 20.549999999999955
skewness_about.1 -16.0 40.0 0.0 37.09999999999991
skewness_about.2 170.5 206.5 178.0 203.0
hollows_ratio 174.125 217.125 182.0 210.0

Lets have a look at the box plot again after handling outliers.

In [21]:
sns.set(context="paper", font="monospace")
# Create a figure instance
fig = plt.figure(1, figsize=(18, 12))

# Create an axes instance
ax = fig.add_subplot(111)

g = sns.boxplot(data=X, ax=ax, color="blue")
g.set_xticklabels(X.columns,rotation=90)

# Add transparency to colors
for patch in g.artists:
    r, g, b, a = patch.get_facecolor()
    patch.set_facecolor((r, g, b, .3)) 

Inference

From the above visual plots we can see that most of the variables are highly correlated. This correlation between variables brings about a redundancy in the information that can be gathered by the data set. Thus in order to reduce noise (which could lead to the computational complexities in huge datasets) , we will use PCA to transform the original variables to the linear combination of these variables which are independent.

Based on the percentage of variation that we want to be captured in transformed data set, we will select the number of Principal Components to be considered.

Standardise the dataset using Standard Scalar

Whether to standardize the data prior to a PCA on the covariance matrix depends on the measurement scales of the original features. Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it measured on different scales. Let us transform data onto unit scale (mean=0 and variance=1), which is a requirement for the optimal performance of many machine learning algorithms.

In [22]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

Eigendecomposition - Computing Eigenvectors and Eigenvalues using covariance matrix

The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the “core” of a PCA: The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.
In [23]:
cov_mat1 = np.cov(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat1)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

eig_vals.sum()
Eigenvectors 
[[-2.73659245e-01  9.62368925e-02 -1.50110684e-01 -1.73438920e-01
  -7.64799190e-02 -3.96076369e-02 -3.32412510e-01 -6.82211513e-01
  -4.07216723e-01  2.80830338e-01  6.34917650e-02  6.14529983e-02
  -1.18378298e-02 -1.34973017e-01 -7.62908893e-02  2.66435277e-03
   5.49305713e-02  1.43109830e-02]
 [-2.89814589e-01 -1.28191787e-01  1.52925214e-01 -1.29118466e-01
   2.60088959e-02 -1.26160911e-01  3.92439139e-01 -1.29284490e-01
   6.58820588e-03 -5.07845800e-02 -4.76857584e-02 -2.87283821e-01
  -1.14549533e-02  4.58068409e-02 -2.22478706e-01 -8.13124423e-02
  -1.87183989e-01  7.00996759e-01]
 [-3.03220331e-01  5.30650428e-02 -9.98326148e-02  1.87457071e-02
   8.81045256e-02 -8.68490587e-02 -1.05862158e-02  3.75869566e-01
  -2.76861635e-01  1.09700513e-01  7.52532781e-01 -1.80439839e-02
   7.20598520e-03  6.13174301e-02  2.21502411e-01  8.11329154e-03
  -1.67735771e-01  4.28309332e-02]
 [-2.72051437e-01  2.01824436e-01  1.78082498e-01  1.74125747e-01
  -5.79846533e-02  2.48535050e-01 -1.29600699e-01  6.78675797e-02
  -1.43166813e-02  1.69735819e-01 -1.12671975e-01 -3.65248757e-01
  -2.71475252e-02  4.58136072e-01 -3.94423478e-01  4.92183137e-02
  -2.96299451e-01 -3.24282809e-01]
 [-9.77586130e-02  2.48081077e-01  5.48016921e-01  2.97449379e-01
  -8.89412806e-02  5.26988429e-01  1.64370979e-01 -8.21711339e-02
  -2.41876255e-01 -1.08954061e-01  2.51096155e-02  1.58478450e-01
   1.83679629e-02 -1.87764312e-01  2.34729383e-01 -1.78102717e-02
   1.42461996e-01  1.15233392e-01]
 [-1.45932573e-01  6.78003054e-02  3.40656908e-01  9.99690277e-02
   7.67018862e-01 -2.39374680e-01 -3.45161311e-01  1.59493692e-02
   1.20778666e-01  5.58478662e-02 -1.52660978e-01  4.54672602e-02
  -3.65749402e-03 -1.41822655e-01  8.55825272e-02  1.78266761e-02
  -1.01679568e-01  4.32426776e-02]
 [-3.13371347e-01 -6.76069936e-02 -1.25191475e-01  4.21270513e-02
  -9.51125418e-02  8.31808318e-03 -9.29117641e-02  1.02949567e-01
  -7.05619561e-03 -1.84080559e-01 -1.77061025e-01  1.32759516e-01
   8.39509877e-01 -4.90982385e-02  4.66731908e-03  2.36555230e-01
  -6.84615948e-02  2.58369000e-02]
 [ 3.11423599e-01  4.64104581e-03  6.13995900e-02 -6.96047680e-02
   1.27391063e-01 -1.77804236e-02  7.94193220e-02 -2.30369914e-01
  -1.01823022e-01  1.95635844e-01 -4.88327053e-02 -8.62350910e-02
   2.40865680e-01  6.45962244e-01  5.17268541e-01  5.86805829e-02
  -2.45007352e-03  1.17024050e-01]
 [-3.10090708e-01 -8.02005995e-02 -1.41274899e-01  2.71473387e-02
  -8.22270454e-02 -1.95163746e-03 -9.56403486e-02  5.67005500e-02
  -6.26574423e-02 -2.09189338e-01 -2.70295920e-01  2.24074627e-01
  -1.09957030e-01  1.80422008e-01  2.64982661e-01 -7.29560784e-01
  -1.93113614e-01 -4.71507532e-02]
 [-2.79916321e-01 -1.19898404e-01  1.48150426e-01 -1.39912110e-01
   1.34148688e-01 -2.29837500e-01  3.69782052e-01 -2.62795527e-01
  -3.94109192e-02 -4.34411807e-01  9.97783720e-02 -2.30062642e-01
  -1.34211545e-02  4.07015197e-02  1.12231052e-01  6.07866018e-02
   2.28719944e-01 -5.15300030e-01]
 [-3.04624073e-01 -6.99424543e-02 -5.57416726e-02  1.14983172e-01
  -7.77240824e-02  8.21043678e-02 -2.44357430e-01  8.54240270e-02
   3.51001510e-01  1.74915042e-01  4.68878605e-02 -3.01863566e-01
   1.01434774e-02  9.38401699e-02  1.19896171e-01 -1.22321182e-01
   7.03836448e-01  1.59512988e-01]
 [-3.09985938e-01 -7.33903049e-02 -1.37782283e-01  3.19861038e-02
  -1.27575853e-01  1.72051531e-02 -1.12121799e-01  5.75159076e-02
   1.81299049e-02 -1.76787682e-01 -2.40143791e-01  1.65718925e-01
  -4.70256186e-01  1.54690083e-01  2.83129333e-01  6.15496655e-01
  -8.34177248e-02  1.35363838e-01]
 [-2.67166566e-01 -2.07648254e-01  1.51997548e-01 -1.41903140e-01
  -5.02082408e-02  8.49892054e-04  3.91702117e-01  2.49855569e-02
   2.64454713e-01  6.50116305e-01 -7.90509136e-02  3.47583045e-01
   1.08582453e-02 -7.55494745e-02  8.18474600e-02  2.50549041e-02
  -6.23452052e-02 -2.21355431e-01]
 [ 3.84098340e-02 -4.97053428e-01  1.00520282e-01  1.36911752e-01
   3.59940170e-02  2.75581547e-01 -1.99277262e-01 -3.46886737e-01
   3.79098344e-01 -1.99002475e-01  4.15093125e-01  2.29349417e-01
   1.06639481e-02  1.66561193e-01 -1.28276916e-01 -2.09222391e-03
  -1.78758964e-01  2.89718211e-02]
 [-4.11976837e-02  4.05059824e-02 -6.23944487e-02 -7.64357771e-01
   2.64643635e-01  5.47772445e-01 -6.31031233e-02  1.48608970e-01
   1.90478143e-02 -9.73513163e-02 -1.84496860e-02 -1.56661517e-02
  -2.52620392e-03  6.97843922e-03 -1.38431893e-02 -5.56413058e-04
   4.07786925e-02  5.72439877e-03]
 [-5.84396392e-02  9.93776320e-02 -6.11550611e-01  3.79789221e-01
   4.36795513e-01  3.30899525e-01  3.53949895e-01 -1.65881226e-01
   4.41942793e-02  5.38632685e-02 -3.86620845e-02 -5.65599076e-02
  -1.26864733e-02 -7.20898386e-02  5.79448253e-03  6.40877240e-03
  -1.85683887e-02 -8.21440627e-03]
 [-3.55612872e-02  5.12228572e-01 -3.33840296e-02 -1.11739400e-01
  -2.10069308e-01 -3.88241437e-02 -3.58223149e-02 -2.15124439e-01
   5.51136108e-01 -5.60886629e-02  1.28734676e-01 -2.02144060e-01
   4.33026311e-02 -2.11970166e-01  3.21970587e-01 -1.99196829e-02
  -3.37946776e-01 -1.22791050e-02]
 [-8.06065172e-02  5.14647903e-01 -1.95119391e-02 -5.55017122e-02
   7.07138915e-02 -1.75872823e-01  1.25041682e-01 -3.99680132e-02
   1.58226849e-01 -1.35871893e-01  1.31076173e-01  5.36908364e-01
  -2.71164164e-03  3.74723240e-01 -3.25031960e-01 -5.76901800e-03
   2.58227884e-01  1.01392223e-01]]

Eigenvalues 
[9.58932069e+00 3.27978156e+00 1.22330781e+00 1.17470743e+00
 9.07747354e-01 7.24274859e-01 3.94195095e-01 2.23015061e-01
 1.62185403e-01 9.68023855e-02 6.63769065e-02 5.20284998e-02
 3.04502671e-03 3.92858270e-02 3.25155033e-02 1.02488584e-02
 2.24480608e-02 2.00154434e-02]
Out[23]:
18.02130177514793
The typical goal of a PCA is to reduce the dimensionality of the original feature space by projecting it onto a smaller subspace, where the eigenvectors will form the axes. However, the eigenvectors only define the directions of the new axis. In order to decide which eigenvector(s) can dropped without losing too much information for the construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped.

Explained Variance

To decide "how many principal components are we going to choose for our new feature subspace?" A useful measure is the so-called “explained variance” which can be calculated from the eigenvalues. The explained variance tells us how much information (variance) can be attributed to each of the principal components.
In [24]:
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
Cumulative Variance Explained [ 53.21103222  71.41050306  78.19862429  84.71706252  89.75414234
  93.77313535  95.96051947  97.19802753  98.0979925   98.63514784
  99.00347255  99.29217811  99.51017472  99.69060288  99.81516691
  99.92623238  99.98310318 100.        ]
In [25]:
plt.plot(var_exp)
Out[25]:
[<matplotlib.lines.Line2D at 0x23e73252710>]
In [26]:
# Ploting 
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.xticks(np.arange(19))
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

The plot above clearly shows that most of the variance (53.21% of the variance to be precise) can be explained by the first principal component, second principal component bears some information (18.2%) and so on. Together, the first 7 principal components contain 95.96% of the information and first 8 principal components contain 97.19% of information. We can choose either 7 or 8 and sample ignore rest of the principal components

In [27]:
# Using scikit learn PCA here. It does all the above steps and maps data to PCA dimensions in one shot
from sklearn.decomposition import PCA

# NOTE - we are generating only 8 PCA dimensions (dimensionality reduction from 17 to 8)

pca = PCA(n_components=10)
data_reduced = pca.fit_transform(X_std)
data_reduced.transpose()
Out[27]:
array([[ 0.47877102, -1.57906668,  3.84439877, ...,  4.96230557,
        -3.30298007, -4.88487976],
       [-0.58779774, -0.33976099,  0.18407394, ..., -0.11703698,
        -1.0403932 ,  0.41251757],
       [ 1.35598237, -0.49602455,  0.57970627, ...,  1.31956559,
        -1.36197926, -1.12346188],
       ...,
       [-0.48457693,  0.17748415,  0.37195089, ..., -0.14476684,
         0.42559936, -0.22792672],
       [-0.81534206,  0.0440285 , -0.28893804, ..., -0.64346111,
        -0.30438893, -0.46255093],
       [-0.04692186, -0.03401882,  0.40918655, ..., -0.50535866,
         0.88257294,  0.34316515]])
In [28]:
df_comp = pd.DataFrame(pca.components_,columns=list(X))
df_comp.head()
Out[28]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 0.273659 0.289815 0.303220 0.272051 0.097759 0.145933 0.313371 -0.311424 0.310091 0.279916 0.304624 0.309986 0.267167 -0.038410 0.041198 0.058440 0.035561 0.080607
1 -0.096237 0.128192 -0.053065 -0.201824 -0.248081 -0.067800 0.067607 -0.004641 0.080201 0.119898 0.069942 0.073390 0.207648 0.497053 -0.040506 -0.099378 -0.512229 -0.514648
2 -0.150111 0.152925 -0.099833 0.178082 0.548017 0.340657 -0.125191 0.061400 -0.141275 0.148150 -0.055742 -0.137782 0.151998 0.100520 -0.062394 -0.611551 -0.033384 -0.019512
3 0.173439 0.129118 -0.018746 -0.174126 -0.297449 -0.099969 -0.042127 0.069605 -0.027147 0.139912 -0.114983 -0.031986 0.141903 -0.136912 0.764358 -0.379789 0.111739 0.055502
4 -0.076480 0.026009 0.088105 -0.057985 -0.088941 0.767019 -0.095113 0.127391 -0.082227 0.134149 -0.077724 -0.127576 -0.050208 0.035994 0.264644 0.436796 -0.210069 0.070714
In [29]:
plt.figure(figsize=(12,6))
sns.heatmap(df_comp,cmap='plasma',)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x23e7327e128>

Split the data into train and test

In [30]:
#Test train split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data_reduced,y.values.ravel(),test_size=0.2,random_state=3)

Gaussian NaiveBayes

In [31]:
from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import cross_val_score, cross_val_predict,cross_validate
from sklearn import metrics

nb_clf = GaussianNB()

scoring = {'acc': 'accuracy',
           'prec_macro': 'precision_macro',
           'rec_macro': 'recall_macro'}

scores=cross_validate(nb_clf, X_train,y_train, cv=5,scoring=scoring,return_train_score=True)

print(scores.keys())

print("Cross-validated scores for train accuracy:", scores['train_acc'].mean()) 
print("Cross-validated scores for train precision:", scores['train_prec_macro'].mean()) 
print("Cross-validated scores for train recall:", scores['train_prec_macro'].mean()) 


print("\nCross-validated scores for test accuracy:", scores['test_acc'].mean()) 
print("Cross-validated scores for test precision:", scores['test_prec_macro'].mean()) 
print("Cross-validated scores for test recall:", scores['test_rec_macro'].mean()) 

# Make cross validated predictions
predictions = cross_val_predict(nb_clf, X_train,y_train, cv=5)

# Train the model (a.k.a. `fit` training data to it).
nb_clf.fit(X_train,y_train)
# Use the model to make predictions based on testing data.
y_pred_nb = nb_clf.predict(X_test)

#Compute confusion matrix
from sklearn.metrics import confusion_matrix
cm_nb = confusion_matrix(y_test,y_pred_nb)
cm_nb
dict_keys(['fit_time', 'score_time', 'test_acc', 'train_acc', 'test_prec_macro', 'train_prec_macro', 'test_rec_macro', 'train_rec_macro'])
Cross-validated scores for train accuracy: 0.8657593702862474
Cross-validated scores for train precision: 0.8637478473566975
Cross-validated scores for train recall: 0.8637478473566975

Cross-validated scores for test accuracy: 0.8505079179267063
Cross-validated scores for test precision: 0.8532564027020048
Cross-validated scores for test recall: 0.8263701363285761
Out[31]:
array([[41,  2,  6],
       [ 6, 73,  3],
       [ 1, 10, 28]], dtype=int64)

Support Vector Machine

In [32]:
from sklearn.svm import SVC
classifier_svm_kernel = SVC(C=1.0,kernel='rbf')

scoring = {'acc': 'accuracy',
           'prec_macro': 'precision_macro',
           'rec_macro': 'recall_macro'}

# Perform 5-fold cross validation
svm_scores = cross_validate(classifier_svm_kernel, X_train,y_train, cv=5,scoring=scoring,return_train_score=True)
print(svm_scores.keys())

print("Cross-validated scores for train accuracy:", svm_scores['train_acc'].mean()) 
print("Cross-validated scores for train precision:", svm_scores['train_prec_macro'].mean()) 
print("Cross-validated scores for train recall:", svm_scores['train_prec_macro'].mean()) 


print("\nCross-validated scores for test accuracy:", svm_scores['test_acc'].mean()) 
print("Cross-validated scores for test precision:", svm_scores['test_prec_macro'].mean()) 
print("Cross-validated scores for test recall:", svm_scores['test_rec_macro'].mean()) 

# Make cross validated predictions
predictions = cross_val_predict(classifier_svm_kernel, X_train,y_train, cv=5)

# Train the model (a.k.a. `fit` training data to it).
classifier_svm_kernel.fit(X_train,y_train)
# Use the model to make predictions based on testing data.
y_pred_svm = classifier_svm_kernel.predict(X_test)

#Compute confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred_svm)
cm
dict_keys(['fit_time', 'score_time', 'test_acc', 'train_acc', 'test_prec_macro', 'train_prec_macro', 'test_rec_macro', 'train_rec_macro'])
Cross-validated scores for train accuracy: 0.9774375046892798
Cross-validated scores for train precision: 0.9757109944802306
Cross-validated scores for train recall: 0.9757109944802306

Cross-validated scores for test accuracy: 0.9585718466491073
Cross-validated scores for test precision: 0.9569723979526294
Cross-validated scores for test recall: 0.9502989430291221
Out[32]:
array([[48,  0,  1],
       [ 2, 78,  2],
       [ 0,  4, 35]], dtype=int64)
In [33]:
#Comparing the predictions with the actual results
comparison = pd.DataFrame(y_test,columns=['y_test'])
comparison['y_predicted'] = y_pred_svm
comparison.head()
Out[33]:
y_test y_predicted
0 0 0
1 0 0
2 2 2
3 1 1
4 0 0

Fine tune SVM parameters using GridSearchCV

In [34]:
#Applying grid search for optimal parameters and model after k-fold validation
from sklearn.model_selection import GridSearchCV

parameters = [{'C':[0.01,0.05,0.5, 0.1,5.3], 'kernel':['rbf','linear'], 'gamma': [0.01, 0.05,0.1,0.125,0.15, 0.17, 0.5,1]}]
grid_search = GridSearchCV(estimator=classifier_svm_kernel, param_grid=parameters, scoring ='accuracy',cv=5,n_jobs=-1)
grid_search = grid_search.fit(X_train,y_train)
In [35]:
best_accuracy = grid_search.best_score_
best_accuracy
Out[35]:
0.9630177514792899
In [36]:
opt_param = grid_search.best_params_
opt_param
Out[36]:
{'C': 5.3, 'gamma': 0.15, 'kernel': 'rbf'}
In [37]:
y_pred = grid_search.predict(X_test)

#Compute confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(y_test,y_pred)
cm
Out[37]:
array([[48,  0,  1],
       [ 2, 80,  0],
       [ 0,  2, 37]], dtype=int64)
In [38]:
print(classification_report(y_test, y_pred, target_names = ['bus', 'car', 'van']))
              precision    recall  f1-score   support

         bus       0.96      0.98      0.97        49
         car       0.98      0.98      0.98        82
         van       0.97      0.95      0.96        39

    accuracy                           0.97       170
   macro avg       0.97      0.97      0.97       170
weighted avg       0.97      0.97      0.97       170

Conclusion

  • Most of the variables in this original dataset are highly correlated. Correlation between variables brings about a redundancy in the information , in order to reduce noise we used PCA to transform the original variables to the linear combination of these variables which are independent.
  • Based on the percentage of variation explained by each principle component, we choose to consider first 10 components as it explains close to 98.63% of variability. That is dimensionality reduction from 17 to 10, ignoring rest of the principle components.

  • Here we can see that we are not loosing much information by transforming the components to a new feature space and we are able to capture most of the variance explained by these new principle components.

  • With SVM we are able to classify the class with a very good precision,recall and f1-scores over Naive bayes classifier
In [ ]: